Der vorliegende Datensatz enthält Daten von der Vermietung von Fahrrädern in der Region San Francisco.
Die Datenbasis ist eine SQLite3-tabelle.
Der Datensatz hat viel Potential von Analysen. Zu Beginn soll es allerdings reichen, dass du deine bisherigen Fähigkeiten und Kenntnisse übst und vertiefst.
Aufgabe/Ziel:
- mache dich mit den Daten vertraut
- schließe dich an die Datenbank an und erkunde die Metadaten(welche Tabellen mit welchen Spalten
liegen vor? etc...
- Welche Erkenntnisse kannst du ziehen. Visualisiere diese gern mit ein paar Plots. Es ist egal welche Bibliothek du verwendest
Du kannst erkunden was du möchtest. Ziel ist, dass du erkennst, was du bereits alles kannst!
Wofür stehen die Spalten?
Tabelle: Status
-station_id: station ID number
-bikes_available: number of available bikes
-docks_available: number of available docks
-time: date and time, PST
Tabelle: Station Information
-station_id: station ID number (corresponds to "station_id")
-name: name of station
-lat: latitude
-long: longitude
-dockcount: number of total docks at station
-landmark: city (San Francisco, Redwood City, Palo Alto, Mountain View, San Jose)
-installation: original date that station was installed. If station was moved, it is noted below.
Tabelle: TRIP DATA
-Trip ID: numeric ID of bike trip
-Duration: time of trip in seconds
-Start Date: start date of trip with date and time, in PST
-Start Station: station name of start station
-Start Terminal: numeric reference for start station
-End Date: end date of trip with date and time, in PST
-End Station: station name for end station
-End Terminal: numeric reference for end station
-Bike #: ID of bike used
-Subscription Type: Subscriber = annual or 30-day member; Customer = 24-hour or 3-day member
-Zip Code: Home zip code of subscriber (customers can choose to manually enter zip at kiosk however data is unreliable)
Tabelle: WEATHER DATA
Daily weather information per service area, provided from Weather Underground in PST. Weather is listed from north to south (San Francisco, Redwood City, Palo Alto, Mountain View, San Jose).
# benötigte Bibliotheken einlesen
import sqlite3
import pandas as pd
import numpy as np
# Verbindung zur Datenbank aufbauen
conn = sqlite3.connect('rental_bike.sqlite')
# Abfrage
my_query = """
select name
from sqlite_master
where type = 'table'
"""
res_df = pd.read_sql(my_query, conn)
res_df
| name | |
|---|---|
| 0 | station |
| 1 | status |
| 2 | trip |
| 3 | weather |
my_query = """
PRAGMA table_info(station)
"""
res_df = pd.read_sql(my_query, conn)
res_df
| cid | name | type | notnull | dflt_value | pk | |
|---|---|---|---|---|---|---|
| 0 | 0 | id | INTEGER | 0 | None | 1 |
| 1 | 1 | name | TEXT | 0 | None | 0 |
| 2 | 2 | lat | NUMERIC | 0 | None | 0 |
| 3 | 3 | long | NUMERIC | 0 | None | 0 |
| 4 | 4 | dock_count | INTEGER | 0 | None | 0 |
| 5 | 5 | city | TEXT | 0 | None | 0 |
| 6 | 6 | installation_date | TEXT | 0 | None | 0 |
!!! Achtung !!!: Spalte 'installation_date' ist im Textformat.
my_query = """
select count(1)
from station
"""
res_df = pd.read_sql(my_query, conn)
res_df
| count(1) | |
|---|---|
| 0 | 70 |
my_query = """
select city, count(1)
from station
group by city
"""
res_df = pd.read_sql(my_query, conn)
res_df
| city | count(1) | |
|---|---|---|
| 0 | Mountain View | 7 |
| 1 | Palo Alto | 5 |
| 2 | Redwood City | 7 |
| 3 | San Francisco | 35 |
| 4 | San Jose | 16 |
-- Tabele 'STATION' enthält die Information über die 70 Stationen aus 5 Städten in California.
my_query = """
select *
from station
"""
# limit 10
station_info_df = pd.read_sql(my_query, conn)
station_info_df.head()
| id | name | lat | long | dock_count | city | installation_date | |
|---|---|---|---|---|---|---|---|
| 0 | 2 | San Jose Diridon Caltrain Station | 37.329732 | -121.901782 | 27 | San Jose | 8/6/2013 |
| 1 | 3 | San Jose Civic Center | 37.330698 | -121.888979 | 15 | San Jose | 8/5/2013 |
| 2 | 4 | Santa Clara at Almaden | 37.333988 | -121.894902 | 11 | San Jose | 8/6/2013 |
| 3 | 5 | Adobe on Almaden | 37.331415 | -121.893200 | 19 | San Jose | 8/5/2013 |
| 4 | 6 | San Pedro Square | 37.336721 | -121.894074 | 15 | San Jose | 8/7/2013 |
my_query = """
PRAGMA table_info(status)
"""
res_df = pd.read_sql(my_query, conn)
res_df
| cid | name | type | notnull | dflt_value | pk | |
|---|---|---|---|---|---|---|
| 0 | 0 | station_id | INTEGER | 0 | None | 0 |
| 1 | 1 | bikes_available | INTEGER | 0 | None | 0 |
| 2 | 2 | docks_available | INTEGER | 0 | None | 0 |
| 3 | 3 | time | TEXT | 0 | None | 0 |
!!! Achtung !!!: Spalte 'time' ist im Textformat.
my_query = """
select count(1)
from status
"""
res_df = pd.read_sql(my_query, conn)
res_df
| count(1) | |
|---|---|
| 0 | 71984434 |
my_query = """
select *
from status
limit 10
"""
res_df = pd.read_sql(my_query, conn)
res_df
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 2 | 2 | 25 | 2013/08/29 12:06:01 |
| 1 | 2 | 2 | 25 | 2013/08/29 12:07:01 |
| 2 | 2 | 2 | 25 | 2013/08/29 12:08:01 |
| 3 | 2 | 2 | 25 | 2013/08/29 12:09:01 |
| 4 | 2 | 2 | 25 | 2013/08/29 12:10:01 |
| 5 | 2 | 2 | 25 | 2013/08/29 12:11:01 |
| 6 | 2 | 2 | 25 | 2013/08/29 12:12:01 |
| 7 | 2 | 2 | 25 | 2013/08/29 12:13:01 |
| 8 | 2 | 2 | 25 | 2013/08/29 12:15:01 |
| 9 | 2 | 2 | 25 | 2013/08/29 12:16:02 |
Info über der Status der Stationen wurde jeder Minute in die Tabelle 'STATUS' gespeichert.
time fotmat: 2013/08/29 12:06:01 -> YYYY/MM/DD MM:HH:SS
my_query = """
select distinct station_id
from status
"""
res_df = pd.read_sql(my_query, conn)
res_df
| station_id | |
|---|---|
| 0 | 2 |
| 1 | 3 |
| 2 | 4 |
| 3 | 5 |
| 4 | 6 |
| ... | ... |
| 65 | 77 |
| 66 | 80 |
| 67 | 82 |
| 68 | 83 |
| 69 | 84 |
70 rows × 1 columns
my_query = """
select max(time)
from status
"""
res_df = pd.read_sql(my_query, conn)
res_df
| max(time) | |
|---|---|
| 0 | 2015-08-31 23:59:02 |
my_query = """
select min(time)
from status
"""
res_df = pd.read_sql(my_query, conn)
res_df
| min(time) | |
|---|---|
| 0 | 2013/08/29 12:06:01 |
-- Die Information in der Tabelle 'STATUS' verfügt von 2013/08/29 12:06:01 bis zum 2015-08-31 23:59:02
# my_query = """
# select *
# from status
# where station_id in (select id from station where city = 'San Francisco')
# and substr(time, 1, 4)||substr(time, 6, 2)||substr(time, 9, 2)||substr(time, 12, 2)||substr(time, 15, 2) >= '201404140000'
# and substr(time, 1, 4)||substr(time, 6, 2)||substr(time, 9, 2)||substr(time, 12, 2)||substr(time, 15, 2) <= '201404202359'
# """
# res_df = pd.read_sql(my_query, conn)
# res_df
# # and time >= '2014/04/14 00:00:00'
# # and time <= '2014/04/19 23:59:59'
# df_sf_14_19_april_2014 = res_df.copy()
my_query = """
select *
from status
where station_id = 70
"""
station_70_df = pd.read_sql(my_query, conn)
station_70_df
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 70 | 0 | 19 | 2013/08/29 12:06:01 |
| 1 | 70 | 0 | 19 | 2013/08/29 12:07:01 |
| 2 | 70 | 0 | 19 | 2013/08/29 12:08:01 |
| 3 | 70 | 0 | 19 | 2013/08/29 12:09:01 |
| 4 | 70 | 0 | 19 | 2013/08/29 12:10:01 |
| ... | ... | ... | ... | ... |
| 1047135 | 70 | 15 | 4 | 2015-08-31 23:55:02 |
| 1047136 | 70 | 15 | 4 | 2015-08-31 23:56:01 |
| 1047137 | 70 | 15 | 4 | 2015-08-31 23:57:02 |
| 1047138 | 70 | 15 | 4 | 2015-08-31 23:58:02 |
| 1047139 | 70 | 15 | 4 | 2015-08-31 23:59:02 |
1047140 rows × 4 columns
my_query = """
select *
from status
where bikes_available = 0 or docks_available = 0
"""
no_bikes_or_docks_df = pd.read_sql(my_query, conn)
no_bikes_or_docks_df
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 2 | 0 | 27 | 2013/11/03 02:00:01 |
| 1 | 2 | 0 | 27 | 2013/11/03 02:01:01 |
| 2 | 2 | 0 | 27 | 2013/11/03 02:02:01 |
| 3 | 2 | 0 | 27 | 2013/11/03 02:03:01 |
| 4 | 2 | 0 | 27 | 2013/12/04 10:02:03 |
| ... | ... | ... | ... | ... |
| 855438 | 84 | 0 | 15 | 2015-08-27 16:26:02 |
| 855439 | 84 | 0 | 15 | 2015-08-27 16:27:02 |
| 855440 | 84 | 0 | 15 | 2015-08-27 16:28:02 |
| 855441 | 84 | 0 | 15 | 2015-08-27 16:29:02 |
| 855442 | 84 | 0 | 15 | 2015-08-27 16:30:02 |
855443 rows × 4 columns
my_query = """
select *
from status
where bikes_available = 1 or docks_available = 1
"""
critical_bikes_or_docks_df = pd.read_sql(my_query, conn)
critical_bikes_or_docks_df
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 2 | 1 | 26 | 2013/08/29 20:38:01 |
| 1 | 2 | 1 | 26 | 2013/08/29 20:39:01 |
| 2 | 2 | 1 | 26 | 2013/08/29 20:40:02 |
| 3 | 2 | 1 | 26 | 2013/08/29 20:41:01 |
| 4 | 2 | 1 | 26 | 2013/08/29 20:42:01 |
| ... | ... | ... | ... | ... |
| 1755764 | 84 | 1 | 14 | 2015-08-27 13:02:02 |
| 1755765 | 84 | 1 | 14 | 2015-08-27 13:03:02 |
| 1755766 | 84 | 1 | 14 | 2015-08-27 13:04:02 |
| 1755767 | 84 | 1 | 14 | 2015-08-27 13:05:02 |
| 1755768 | 84 | 1 | 14 | 2015-08-27 16:31:02 |
1755769 rows × 4 columns
my_query = """
select *
from status
where station_id in (select id from station where city = 'San Francisco')
and substr(time, 1, 4)||substr(time, 6, 2)||substr(time, 9, 2)||substr(time, 12, 2)||substr(time, 15, 2) >= '201404160000'
and substr(time, 1, 4)||substr(time, 6, 2)||substr(time, 9, 2)||substr(time, 12, 2)||substr(time, 15, 2) <= '201404162359'
"""
df_sf_16_april_2014 = pd.read_sql(my_query, conn)
df_sf_16_april_2014
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 39 | 14 | 5 | 2014-04-16 00:00:03 |
| 1 | 39 | 14 | 5 | 2014-04-16 00:01:03 |
| 2 | 39 | 14 | 5 | 2014-04-16 00:02:03 |
| 3 | 39 | 14 | 5 | 2014-04-16 00:03:02 |
| 4 | 39 | 14 | 5 | 2014-04-16 00:04:02 |
| ... | ... | ... | ... | ... |
| 50395 | 82 | 6 | 9 | 2014-04-16 23:55:03 |
| 50396 | 82 | 6 | 9 | 2014-04-16 23:56:02 |
| 50397 | 82 | 6 | 9 | 2014-04-16 23:57:03 |
| 50398 | 82 | 6 | 9 | 2014-04-16 23:58:03 |
| 50399 | 82 | 6 | 9 | 2014-04-16 23:59:03 |
50400 rows × 4 columns
my_query = """ SELECT
station_id,
avg(bikes_available) as avg_bikes_available,
avg(docks_available) as avg_docks_available,
strftime('%w',substr(time, 1, 4)||'-'||substr(time, 6, 2)||'-'||substr(time, 9, 2)||' '||substr(time, 12, 2)||':'||substr(time, 15, 2)||':'||'00') as day_of_week,
substr(time, 12, 2) as h
FROM status
GROUP BY station_id, day_of_week, h
"""
station_status_df = pd.read_sql(my_query,con=conn)
station_status_df
| station_id | avg_bikes_available | avg_docks_available | day_of_week | h | |
|---|---|---|---|---|---|
| 0 | 2 | 13.675880 | 13.295195 | 0 | 00 |
| 1 | 2 | 13.610708 | 13.360264 | 0 | 01 |
| 2 | 2 | 13.417803 | 13.552636 | 0 | 02 |
| 3 | 2 | 13.505408 | 13.465537 | 0 | 03 |
| 4 | 2 | 13.496455 | 13.474541 | 0 | 04 |
| ... | ... | ... | ... | ... | ... |
| 11755 | 84 | 7.027626 | 7.958676 | 6 | 19 |
| 11756 | 84 | 7.094770 | 7.891528 | 6 | 20 |
| 11757 | 84 | 7.171005 | 7.815297 | 6 | 21 |
| 11758 | 84 | 7.242466 | 7.743836 | 6 | 22 |
| 11759 | 84 | 7.290183 | 7.696119 | 6 | 23 |
11760 rows × 5 columns
my_query = """
PRAGMA table_info(trip)
"""
res_df = pd.read_sql(my_query, conn)
res_df
| cid | name | type | notnull | dflt_value | pk | |
|---|---|---|---|---|---|---|
| 0 | 0 | id | INTEGER | 0 | None | 1 |
| 1 | 1 | duration | INTEGER | 0 | None | 0 |
| 2 | 2 | start_date | TEXT | 0 | None | 0 |
| 3 | 3 | start_station_name | TEXT | 0 | None | 0 |
| 4 | 4 | start_station_id | INTEGER | 0 | None | 0 |
| 5 | 5 | end_date | TEXT | 0 | None | 0 |
| 6 | 6 | end_station_name | TEXT | 0 | None | 0 |
| 7 | 7 | end_station_id | INTEGER | 0 | None | 0 |
| 8 | 8 | bike_id | INTEGER | 0 | None | 0 |
| 9 | 9 | subscription_type | TEXT | 0 | None | 0 |
| 10 | 10 | zip_code | INTEGER | 0 | None | 0 |
time -> TEXT !!!!!!!!!!!!!!
my_query = """
select count(1)
from trip
"""
res_df = pd.read_sql(my_query, conn)
res_df
| count(1) | |
|---|---|
| 0 | 669959 |
-- Es gibt die Information fast über 700 Tsd. Reisen.
my_query = """
select subscription_type, count(1)
from trip
group by subscription_type
"""
res_df = pd.read_sql(my_query, conn)
res_df
| subscription_type | count(1) | |
|---|---|---|
| 0 | Customer | 103213 |
| 1 | Subscriber | 566746 |
my_query = """
select *
from trip
limit 10
"""
res_df = pd.read_sql(my_query, conn)
res_df
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | zip_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 8/29/2013 9:08 | 2nd at South Park | 64 | 8/29/2013 9:11 | 2nd at South Park | 64 | 288 | Subscriber | 94114 |
| 1 | 4073 | 1067 | 8/29/2013 9:24 | South Van Ness at Market | 66 | 8/29/2013 9:42 | San Francisco Caltrain 2 (330 Townsend) | 69 | 321 | Subscriber | 94703 |
| 2 | 4074 | 1131 | 8/29/2013 9:24 | South Van Ness at Market | 66 | 8/29/2013 9:43 | San Francisco Caltrain 2 (330 Townsend) | 69 | 317 | Subscriber | 94115 |
| 3 | 4075 | 1117 | 8/29/2013 9:24 | South Van Ness at Market | 66 | 8/29/2013 9:43 | San Francisco Caltrain 2 (330 Townsend) | 69 | 316 | Subscriber | 94122 |
| 4 | 4076 | 1118 | 8/29/2013 9:25 | South Van Ness at Market | 66 | 8/29/2013 9:43 | San Francisco Caltrain 2 (330 Townsend) | 69 | 322 | Subscriber | 94597 |
| 5 | 4078 | 3829 | 8/29/2013 9:31 | Redwood City Caltrain Station | 22 | 8/29/2013 10:34 | Redwood City Caltrain Station | 22 | 228 | Customer | 94062 |
| 6 | 4079 | 995 | 8/29/2013 9:35 | South Van Ness at Market | 66 | 8/29/2013 9:52 | South Van Ness at Market | 66 | 327 | Subscriber | 94102 |
| 7 | 4080 | 764 | 8/29/2013 9:36 | South Van Ness at Market | 66 | 8/29/2013 9:49 | San Francisco Caltrain 2 (330 Townsend) | 69 | 315 | Subscriber | 94117 |
| 8 | 4081 | 218 | 8/29/2013 9:38 | Mountain View City Hall | 27 | 8/29/2013 9:41 | Mountain View City Hall | 27 | 150 | Subscriber | 97214 |
| 9 | 4084 | 287 | 8/29/2013 9:41 | Mountain View City Hall | 27 | 8/29/2013 9:46 | Mountain View City Hall | 27 | 138 | Customer | 97214 |
my_query = """
select *
from trip
"""
trip_customer_df = pd.read_sql(my_query, conn)
trip_customer_df.head(5)
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | zip_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 8/29/2013 9:08 | 2nd at South Park | 64 | 8/29/2013 9:11 | 2nd at South Park | 64 | 288 | Subscriber | 94114 |
| 1 | 4073 | 1067 | 8/29/2013 9:24 | South Van Ness at Market | 66 | 8/29/2013 9:42 | San Francisco Caltrain 2 (330 Townsend) | 69 | 321 | Subscriber | 94703 |
| 2 | 4074 | 1131 | 8/29/2013 9:24 | South Van Ness at Market | 66 | 8/29/2013 9:43 | San Francisco Caltrain 2 (330 Townsend) | 69 | 317 | Subscriber | 94115 |
| 3 | 4075 | 1117 | 8/29/2013 9:24 | South Van Ness at Market | 66 | 8/29/2013 9:43 | San Francisco Caltrain 2 (330 Townsend) | 69 | 316 | Subscriber | 94122 |
| 4 | 4076 | 1118 | 8/29/2013 9:25 | South Van Ness at Market | 66 | 8/29/2013 9:43 | San Francisco Caltrain 2 (330 Townsend) | 69 | 322 | Subscriber | 94597 |
trip_customer_df.shape
(669959, 11)
my_query = """
select count(1)
from
(select distinct bike_id
from trip)
"""
res_df = pd.read_sql(my_query, conn)
res_df
| count(1) | |
|---|---|
| 0 | 700 |
-- Es steht ein Bestand von 700 Fahrrädern zum Ausleihen zur Verfügung.
my_query = """
PRAGMA table_info(weather)
"""
res_df = pd.read_sql(my_query, conn)
res_df
| cid | name | type | notnull | dflt_value | pk | |
|---|---|---|---|---|---|---|
| 0 | 0 | date | TEXT | 0 | None | 0 |
| 1 | 1 | max_temperature_f | INTEGER | 0 | None | 0 |
| 2 | 2 | mean_temperature_f | INTEGER | 0 | None | 0 |
| 3 | 3 | min_temperature_f | INTEGER | 0 | None | 0 |
| 4 | 4 | max_dew_point_f | INTEGER | 0 | None | 0 |
| 5 | 5 | mean_dew_point_f | INTEGER | 0 | None | 0 |
| 6 | 6 | min_dew_point_f | INTEGER | 0 | None | 0 |
| 7 | 7 | max_humidity | INTEGER | 0 | None | 0 |
| 8 | 8 | mean_humidity | INTEGER | 0 | None | 0 |
| 9 | 9 | min_humidity | INTEGER | 0 | None | 0 |
| 10 | 10 | max_sea_level_pressure_inches | NUMERIC | 0 | None | 0 |
| 11 | 11 | mean_sea_level_pressure_inches | NUMERIC | 0 | None | 0 |
| 12 | 12 | min_sea_level_pressure_inches | NUMERIC | 0 | None | 0 |
| 13 | 13 | max_visibility_miles | INTEGER | 0 | None | 0 |
| 14 | 14 | mean_visibility_miles | INTEGER | 0 | None | 0 |
| 15 | 15 | min_visibility_miles | INTEGER | 0 | None | 0 |
| 16 | 16 | max_wind_Speed_mph | INTEGER | 0 | None | 0 |
| 17 | 17 | mean_wind_speed_mph | INTEGER | 0 | None | 0 |
| 18 | 18 | max_gust_speed_mph | INTEGER | 0 | None | 0 |
| 19 | 19 | precipitation_inches | INTEGER | 0 | None | 0 |
| 20 | 20 | cloud_cover | INTEGER | 0 | None | 0 |
| 21 | 21 | events | TEXT | 0 | None | 0 |
| 22 | 22 | wind_dir_degrees | INTEGER | 0 | None | 0 |
| 23 | 23 | zip_code | INTEGER | 0 | None | 0 |
my_query = """
select *
from weather
limit 10
"""
res_df = pd.read_sql(my_query, conn)
res_df
| date | max_temperature_f | mean_temperature_f | min_temperature_f | max_dew_point_f | mean_dew_point_f | min_dew_point_f | max_humidity | mean_humidity | min_humidity | ... | mean_visibility_miles | min_visibility_miles | max_wind_Speed_mph | mean_wind_speed_mph | max_gust_speed_mph | precipitation_inches | cloud_cover | events | wind_dir_degrees | zip_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8/29/2013 | 74 | 68 | 61 | 61 | 58 | 56 | 93 | 75 | 57 | ... | 10 | 10 | 23 | 11 | 28 | 0 | 4 | 286 | 94107 | |
| 1 | 8/30/2013 | 78 | 69 | 60 | 61 | 58 | 56 | 90 | 70 | 50 | ... | 10 | 7 | 29 | 13 | 35 | 0 | 2 | 291 | 94107 | |
| 2 | 8/31/2013 | 71 | 64 | 57 | 57 | 56 | 54 | 93 | 75 | 57 | ... | 10 | 10 | 26 | 15 | 31 | 0 | 4 | 284 | 94107 | |
| 3 | 9/1/2013 | 74 | 66 | 58 | 60 | 56 | 53 | 87 | 68 | 49 | ... | 10 | 10 | 25 | 13 | 29 | 0 | 4 | 284 | 94107 | |
| 4 | 9/2/2013 | 75 | 69 | 62 | 61 | 60 | 58 | 93 | 77 | 61 | ... | 10 | 6 | 23 | 12 | 30 | 0 | 6 | 277 | 94107 | |
| 5 | 9/3/2013 | 73 | 67 | 60 | 59 | 56 | 51 | 84 | 65 | 46 | ... | 10 | 10 | 24 | 15 | 31 | 0 | 2 | 276 | 94107 | |
| 6 | 9/4/2013 | 74 | 68 | 61 | 59 | 57 | 56 | 90 | 72 | 53 | ... | 10 | 10 | 29 | 19 | 35 | 0 | 4 | 269 | 94107 | |
| 7 | 9/5/2013 | 72 | 66 | 60 | 57 | 56 | 54 | 90 | 74 | 57 | ... | 10 | 10 | 31 | 21 | 37 | 0 | 3 | 270 | 94107 | |
| 8 | 9/6/2013 | 85 | 71 | 56 | 57 | 51 | 45 | 86 | 58 | 29 | ... | 10 | 10 | 24 | 8 | 28 | 0 | 0 | 287 | 94107 | |
| 9 | 9/7/2013 | 88 | 73 | 58 | 64 | 54 | 46 | 86 | 59 | 31 | ... | 10 | 10 | 21 | 8 | 25 | 0 | 1 | 305 | 94107 |
10 rows × 24 columns
# station_info_df
# trip_customer_df
# no_bikes_df
# full_station_df
station_info_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 70 entries, 0 to 69 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 70 non-null int64 1 name 70 non-null object 2 lat 70 non-null float64 3 long 70 non-null float64 4 dock_count 70 non-null int64 5 city 70 non-null object 6 installation_date 70 non-null object dtypes: float64(2), int64(2), object(3) memory usage: 4.0+ KB
station_info_df.loc[:, ['city', 'lat', 'long']].head()
| city | lat | long | |
|---|---|---|---|
| 0 | San Jose | 37.329732 | -121.901782 |
| 1 | San Jose | 37.330698 | -121.888979 |
| 2 | San Jose | 37.333988 | -121.894902 |
| 3 | San Jose | 37.331415 | -121.893200 |
| 4 | San Jose | 37.336721 | -121.894074 |
station_info_df.shape
(70, 7)
station_info_df.loc[:, 'id']
0 2
1 3
2 4
3 5
4 6
..
65 77
66 80
67 82
68 83
69 84
Name: id, Length: 70, dtype: int64
stations_trip = trip_customer_df.loc[:, 'start_station_id'].unique()
stations_trip
array([64, 66, 22, 27, 45, 35, 73, 75, 69, 76, 10, 49, 56, 70, 48, 67, 39,
58, 61, 50, 47, 41, 74, 37, 42, 9, 55, 65, 77, 51, 68, 14, 34, 36,
72, 11, 60, 62, 12, 46, 57, 5, 71, 6, 4, 28, 3, 8, 63, 7, 2,
54, 59, 29, 13, 30, 23, 26, 21, 33, 38, 24, 16, 25, 80, 32, 31, 82,
83, 84], dtype=int64)
len(stations_trip)
70
trip_customer_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 669959 entries, 0 to 669958 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 669959 non-null int64 1 duration 669959 non-null int64 2 start_date 669959 non-null object 3 start_station_name 669959 non-null object 4 start_station_id 669959 non-null int64 5 end_date 669959 non-null object 6 end_station_name 669959 non-null object 7 end_station_id 669959 non-null int64 8 bike_id 669959 non-null int64 9 subscription_type 669959 non-null object 10 zip_code 669959 non-null object dtypes: int64(5), object(6) memory usage: 56.2+ MB
trip_customer_df = trip_customer_df.merge(station_info_df.loc[:, ['id', 'city', 'lat', 'long']], left_on='start_station_id', right_on='id',
suffixes=('', '_start'))
trip_customer_df.shape
(669959, 15)
trip_customer_df.drop(columns=['id_start'], inplace=True)
trip_customer_df.rename(columns={"city" : "city_start", 'lat': 'lat_start', 'long': 'long_start'}, inplace=True)
trip_customer_df.head(5)
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | zip_code | city_start | lat_start | long_start | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 8/29/2013 9:08 | 2nd at South Park | 64 | 8/29/2013 9:11 | 2nd at South Park | 64 | 288 | Subscriber | 94114 | San Francisco | 37.782259 | -122.392738 |
| 1 | 4426 | 982 | 8/29/2013 12:52 | 2nd at South Park | 64 | 8/29/2013 13:08 | 2nd at South Park | 64 | 593 | Subscriber | 94117 | San Francisco | 37.782259 | -122.392738 |
| 2 | 4465 | 419 | 8/29/2013 13:11 | 2nd at South Park | 64 | 8/29/2013 13:18 | Post at Kearney | 47 | 498 | Subscriber | 94117 | San Francisco | 37.782259 | -122.392738 |
| 3 | 4503 | 728 | 8/29/2013 13:29 | 2nd at South Park | 64 | 8/29/2013 13:41 | Embarcadero at Vallejo | 48 | 331 | Subscriber | 94110 | San Francisco | 37.782259 | -122.392738 |
| 4 | 4557 | 130 | 8/29/2013 13:57 | 2nd at South Park | 64 | 8/29/2013 13:59 | 2nd at South Park | 64 | 371 | Subscriber | 94122 | San Francisco | 37.782259 | -122.392738 |
trip_customer_df = trip_customer_df.merge(station_info_df.loc[:, ['id', 'city', 'lat', 'long']], left_on='end_station_id', right_on='id',
suffixes=('', '_end'))
trip_customer_df.drop(columns=['id_end'], inplace=True)
trip_customer_df.rename(columns={"city" : "city_end", 'lat': 'lat_end', 'long': 'long_end'}, inplace=True)
trip_customer_df.head(5)
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | zip_code | city_start | lat_start | long_start | city_end | lat_end | long_end | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 8/29/2013 9:08 | 2nd at South Park | 64 | 8/29/2013 9:11 | 2nd at South Park | 64 | 288 | Subscriber | 94114 | San Francisco | 37.782259 | -122.392738 | San Francisco | 37.782259 | -122.392738 |
| 1 | 4426 | 982 | 8/29/2013 12:52 | 2nd at South Park | 64 | 8/29/2013 13:08 | 2nd at South Park | 64 | 593 | Subscriber | 94117 | San Francisco | 37.782259 | -122.392738 | San Francisco | 37.782259 | -122.392738 |
| 2 | 4557 | 130 | 8/29/2013 13:57 | 2nd at South Park | 64 | 8/29/2013 13:59 | 2nd at South Park | 64 | 371 | Subscriber | 94122 | San Francisco | 37.782259 | -122.392738 | San Francisco | 37.782259 | -122.392738 |
| 3 | 4564 | 1693 | 8/29/2013 14:04 | 2nd at South Park | 64 | 8/29/2013 14:32 | 2nd at South Park | 64 | 272 | Subscriber | 94005 | San Francisco | 37.782259 | -122.392738 | San Francisco | 37.782259 | -122.392738 |
| 4 | 4566 | 1699 | 8/29/2013 14:04 | 2nd at South Park | 64 | 8/29/2013 14:33 | 2nd at South Park | 64 | 371 | Subscriber | 94122 | San Francisco | 37.782259 | -122.392738 | San Francisco | 37.782259 | -122.392738 |
trip_customer_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 669959 entries, 0 to 669958 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 669959 non-null int64 1 duration 669959 non-null int64 2 start_date 669959 non-null object 3 start_station_name 669959 non-null object 4 start_station_id 669959 non-null int64 5 end_date 669959 non-null object 6 end_station_name 669959 non-null object 7 end_station_id 669959 non-null int64 8 bike_id 669959 non-null int64 9 subscription_type 669959 non-null object 10 zip_code 669959 non-null object 11 city_start 669959 non-null object 12 lat_start 669959 non-null float64 13 long_start 669959 non-null float64 14 city_end 669959 non-null object 15 lat_end 669959 non-null float64 16 long_end 669959 non-null float64 dtypes: float64(4), int64(5), object(8) memory usage: 92.0+ MB
trip_customer_df.isnull().sum()
id 0 duration 0 start_date 0 start_station_name 0 start_station_id 0 end_date 0 end_station_name 0 end_station_id 0 bike_id 0 subscription_type 0 zip_code 0 city_start 0 lat_start 0 long_start 0 city_end 0 lat_end 0 long_end 0 dtype: int64
trip_customer_df.loc[:, ['start_date', 'end_date']].head(5)
| start_date | end_date | |
|---|---|---|
| 0 | 8/29/2013 9:08 | 8/29/2013 9:11 |
| 1 | 8/29/2013 12:52 | 8/29/2013 13:08 |
| 2 | 8/29/2013 13:57 | 8/29/2013 13:59 |
| 3 | 8/29/2013 14:04 | 8/29/2013 14:32 |
| 4 | 8/29/2013 14:04 | 8/29/2013 14:33 |
trip_customer_df.loc[:, 'start_date'] = pd.to_datetime(trip_customer_df.loc[:, 'start_date'])
trip_customer_df.loc[:, 'end_date'] = pd.to_datetime(trip_customer_df.loc[:, 'end_date'])
trip_customer_df.dtypes
id int64 duration int64 start_date datetime64[ns] start_station_name object start_station_id int64 end_date datetime64[ns] end_station_name object end_station_id int64 bike_id int64 subscription_type object zip_code object city_start object lat_start float64 long_start float64 city_end object lat_end float64 long_end float64 dtype: object
trip_customer_df.loc[:, ['start_date', 'end_date']].head(5)
| start_date | end_date | |
|---|---|---|
| 0 | 2013-08-29 09:08:00 | 2013-08-29 09:11:00 |
| 1 | 2013-08-29 12:52:00 | 2013-08-29 13:08:00 |
| 2 | 2013-08-29 13:57:00 | 2013-08-29 13:59:00 |
| 3 | 2013-08-29 14:04:00 | 2013-08-29 14:32:00 |
| 4 | 2013-08-29 14:04:00 | 2013-08-29 14:33:00 |
trip_customer_df.loc[:, 'subscription_type'] = trip_customer_df.loc[:, 'subscription_type'].astype("category")
trip_customer_df.dtypes
id int64 duration int64 start_date datetime64[ns] start_station_name object start_station_id int64 end_date datetime64[ns] end_station_name object end_station_id int64 bike_id int64 subscription_type category zip_code object city_start object lat_start float64 long_start float64 city_end object lat_end float64 long_end float64 dtype: object
trip_customer_df.loc[:,["id","duration","start_station_id", "end_station_id", "bike_id"]] = trip_customer_df.loc[:,["id","duration","start_station_id", "end_station_id", "bike_id"]].apply(pd.to_numeric, downcast="integer")
trip_customer_df.dtypes
id int32 duration int32 start_date datetime64[ns] start_station_name object start_station_id int8 end_date datetime64[ns] end_station_name object end_station_id int8 bike_id int16 subscription_type category zip_code object city_start object lat_start float64 long_start float64 city_end object lat_end float64 long_end float64 dtype: object
# trip_customer_df.loc[:, 'zip_code'].astype("int")
trip_customer_df.replace({"zip_code":''}, np.nan, inplace=True )
trip_customer_df.isnull().sum()
id 0 duration 0 start_date 0 start_station_name 0 start_station_id 0 end_date 0 end_station_name 0 end_station_id 0 bike_id 0 subscription_type 0 zip_code 6619 city_start 0 lat_start 0 long_start 0 city_end 0 lat_end 0 long_end 0 dtype: int64
trip_customer_df.loc[:,["lat_start","long_start","lat_end", "long_end"]] = trip_customer_df.loc[:,["lat_start","long_start", "lat_end", "long_end"]].apply(pd.to_numeric, downcast="float")
trip_customer_df.dtypes
id int32 duration int32 start_date datetime64[ns] start_station_name object start_station_id int8 end_date datetime64[ns] end_station_name object end_station_id int8 bike_id int16 subscription_type category zip_code object city_start object lat_start float32 long_start float32 city_end object lat_end float32 long_end float32 dtype: object
trip_customer_df['Year_start'] = trip_customer_df.start_date.dt.year
trip_customer_df['Month_start'] = trip_customer_df.start_date.dt.month
trip_customer_df['Day_start'] = trip_customer_df.start_date.dt.day
trip_customer_df['Day_of_week_start'] = trip_customer_df.start_date.dt.day_name()
trip_customer_df['Day_of_week_num_start'] = trip_customer_df.start_date.dt.day_of_week #The day of the week with Monday=0, Sunday=6.
trip_customer_df['Time_start'] = trip_customer_df.start_date.dt.time
#trip_customer_df['rental_duration_sec'] = (trip_customer_df.loc[:, 'end_date'] - trip_customer_df.loc[:, 'start_date']).dt.total_seconds()
trip_customer_df['rental_duration'] = trip_customer_df.loc[:, 'end_date'] - trip_customer_df.loc[:, 'start_date']
# trip_customer_df.drop(columns=['Day_end'], inplace=True)
# trip_customer_df.drop(columns=['rental_duration_sec'], inplace=True)
trip_customer_df.head()
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | ... | city_end | lat_end | long_end | Year_start | Month_start | Day_start | Day_of_week_start | Day_of_week_num_start | Time_start | rental_duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 2013-08-29 09:08:00 | 2nd at South Park | 64 | 2013-08-29 09:11:00 | 2nd at South Park | 64 | 288 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 09:08:00 | 0 days 00:03:00 |
| 1 | 4426 | 982 | 2013-08-29 12:52:00 | 2nd at South Park | 64 | 2013-08-29 13:08:00 | 2nd at South Park | 64 | 593 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 12:52:00 | 0 days 00:16:00 |
| 2 | 4557 | 130 | 2013-08-29 13:57:00 | 2nd at South Park | 64 | 2013-08-29 13:59:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 13:57:00 | 0 days 00:02:00 |
| 3 | 4564 | 1693 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:32:00 | 2nd at South Park | 64 | 272 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:28:00 |
| 4 | 4566 | 1699 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:33:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:29:00 |
5 rows × 24 columns
trip_customer_df.loc[trip_customer_df['duration'] == trip_customer_df['duration'].max(), :]
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | ... | city_end | lat_end | long_end | Year_start | Month_start | Day_start | Day_of_week_start | Day_of_week_num_start | Time_start | rental_duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 135291 | 568474 | 17270400 | 2014-12-06 21:59:00 | South Van Ness at Market | 66 | 2015-06-24 20:18:00 | 2nd at Folsom | 62 | 535 | Customer | ... | San Francisco | 37.785297 | -122.396233 | 2014 | 12 | 6 | Saturday | 5 | 21:59:00 | 199 days 22:19:00 |
1 rows × 24 columns
# Fahrräder, die für mehr als einen Tag ausgeliehen wurden
trip_customer_df.loc[trip_customer_df['duration'] > 60 * 60 * 24, :].sort_values('duration')
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | ... | city_end | lat_end | long_end | Year_start | Month_start | Day_start | Day_of_week_start | Day_of_week_num_start | Time_start | rental_duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 627343 | 897491 | 86553 | 2015-08-19 19:58:00 | Castro Street and El Camino Real | 32 | 2015-08-20 20:01:00 | Evelyn Park and Ride | 30 | 120 | Customer | ... | Mountain View | 37.390278 | -122.066551 | 2015 | 8 | 19 | Wednesday | 2 | 19:58:00 | 1 days 00:03:00 |
| 611014 | 709268 | 86573 | 2015-04-02 10:18:00 | Cowper at University | 37 | 2015-04-03 10:20:00 | Cowper at University | 37 | 188 | Customer | ... | Palo Alto | 37.448597 | -122.159508 | 2015 | 4 | 2 | Thursday | 3 | 10:18:00 | 1 days 00:02:00 |
| 645219 | 84179 | 86609 | 2013-11-05 12:24:00 | San Salvador at 1st | 8 | 2013-11-06 12:28:00 | San Salvador at 1st | 8 | 194 | Customer | ... | San Jose | 37.330166 | -121.885834 | 2013 | 11 | 5 | Tuesday | 1 | 12:24:00 | 1 days 00:04:00 |
| 608480 | 222321 | 86668 | 2014-03-21 08:22:00 | San Antonio Caltrain Station | 29 | 2014-03-22 08:27:00 | San Antonio Caltrain Station | 29 | 662 | Customer | ... | Mountain View | 37.406940 | -122.106758 | 2014 | 3 | 21 | Friday | 4 | 08:22:00 | 1 days 00:05:00 |
| 665985 | 48791 | 86890 | 2013-10-07 12:29:00 | San Jose Civic Center | 3 | 2013-10-08 12:38:00 | San Jose Civic Center | 3 | 118 | Customer | ... | San Jose | 37.330696 | -121.888977 | 2013 | 10 | 7 | Monday | 0 | 12:29:00 | 1 days 00:09:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 612403 | 111309 | 722236 | 2013-11-30 13:29:00 | University and Emerson | 35 | 2013-12-08 22:06:00 | University and Emerson | 35 | 247 | Customer | ... | Palo Alto | 37.444523 | -122.163094 | 2013 | 11 | 30 | Saturday | 5 | 13:29:00 | 8 days 08:37:00 |
| 613014 | 841176 | 1133540 | 2015-07-10 10:35:00 | University and Emerson | 35 | 2015-07-23 13:27:00 | University and Emerson | 35 | 262 | Customer | ... | Palo Alto | 37.444523 | -122.163094 | 2015 | 7 | 10 | Friday | 4 | 10:35:00 | 13 days 02:52:00 |
| 630246 | 750192 | 1852590 | 2015-05-02 06:17:00 | San Antonio Shopping Center | 31 | 2015-05-23 16:53:00 | Castro Street and El Camino Real | 32 | 680 | Subscriber | ... | Mountain View | 37.385956 | -122.083679 | 2015 | 5 | 2 | Saturday | 5 | 06:17:00 | 21 days 10:36:00 |
| 416121 | 825850 | 2137000 | 2015-06-28 21:50:00 | Market at Sansome | 77 | 2015-07-23 15:27:00 | Yerba Buena Center of the Arts (3rd @ Howard) | 68 | 466 | Customer | ... | San Francisco | 37.784878 | -122.401016 | 2015 | 6 | 28 | Sunday | 6 | 21:50:00 | 24 days 17:37:00 |
| 135291 | 568474 | 17270400 | 2014-12-06 21:59:00 | South Van Ness at Market | 66 | 2015-06-24 20:18:00 | 2nd at Folsom | 62 | 535 | Customer | ... | San Francisco | 37.785297 | -122.396233 | 2014 | 12 | 6 | Saturday | 5 | 21:59:00 | 199 days 22:19:00 |
296 rows × 24 columns
# kategorien = ['Mon.','Tue.','Wed.', 'Thu.', 'Fri.', 'Sat.', 'Sun.']
# trip_customer_df["Day_of_week_start"] = pd.cut(trip_customer_df.loc[:, "Day_of_week_start"], bins = [-1, 0, 1, 2, 3,4,5,6] , labels = kategorien)
# trip_customer_df.head()
trip_customer_df.loc[:, 'subscription_type'].value_counts()
Subscriber 566746 Customer 103213 Name: subscription_type, dtype: int64
trip_customer_df.shape
(669959, 24)
trip_customer_df.to_pickle('trip_customer_df_after_preprocessing.pkl')
station_info_df.to_pickle('station_info_df.pkl')
trip_customer_df.duration.describe()
count 6.699590e+05 mean 1.107950e+03 std 2.225544e+04 min 6.000000e+01 25% 3.440000e+02 50% 5.170000e+02 75% 7.550000e+02 max 1.727040e+07 Name: duration, dtype: float64
no_bikes_or_docks_df.head()
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 2 | 0 | 27 | 2013/11/03 02:00:01 |
| 1 | 2 | 0 | 27 | 2013/11/03 02:01:01 |
| 2 | 2 | 0 | 27 | 2013/11/03 02:02:01 |
| 3 | 2 | 0 | 27 | 2013/11/03 02:03:01 |
| 4 | 2 | 0 | 27 | 2013/12/04 10:02:03 |
no_bikes_or_docks_df.loc[:, 'time'] = pd.to_datetime(no_bikes_or_docks_df.loc[:, 'time'])
no_bikes_or_docks_df.dtypes
station_id int64 bikes_available int64 docks_available int64 time datetime64[ns] dtype: object
no_bikes_df = no_bikes_or_docks_df.loc[no_bikes_or_docks_df.bikes_available == 0, :].groupby('station_id', as_index=False).size()
no_docks_df = no_bikes_or_docks_df.loc[no_bikes_or_docks_df.docks_available == 0, :].groupby('station_id', as_index=False).size()
station_problem_df = no_bikes_df.merge(no_docks_df, left_on='station_id', right_on='station_id', how='left')
station_problem_df.rename(columns={'size_x' : 'no_bikes_min', 'size_y' : 'no_docks_min'}, inplace=True)
station_problem_df.replace({"no_docks_min":np.nan}, 0 , inplace=True )
station_problem_df.loc[:, ["no_bikes_min","no_docks_min","station_id"]] = station_problem_df.loc[:,
["no_bikes_min","no_docks_min","station_id"]].apply(pd.to_numeric, downcast="integer")
station_problem_df
| station_id | no_bikes_min | no_docks_min | |
|---|---|---|---|
| 0 | 2 | 1340 | 466 |
| 1 | 3 | 1066 | 4817 |
| 2 | 4 | 10618 | 6614 |
| 3 | 5 | 1321 | 0 |
| 4 | 6 | 3564 | 3633 |
| ... | ... | ... | ... |
| 65 | 77 | 4436 | 3129 |
| 66 | 80 | 2105 | 5 |
| 67 | 82 | 17824 | 3738 |
| 68 | 83 | 808 | 0 |
| 69 | 84 | 3642 | 1744 |
70 rows × 3 columns
station_problem_df.dtypes
station_id int8 no_bikes_min int16 no_docks_min int16 dtype: object
critical_bikes_or_docks_df.head()
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 2 | 1 | 26 | 2013/08/29 20:38:01 |
| 1 | 2 | 1 | 26 | 2013/08/29 20:39:01 |
| 2 | 2 | 1 | 26 | 2013/08/29 20:40:02 |
| 3 | 2 | 1 | 26 | 2013/08/29 20:41:01 |
| 4 | 2 | 1 | 26 | 2013/08/29 20:42:01 |
critical_bikes_or_docks_df.loc[:, 'time'] = pd.to_datetime(critical_bikes_or_docks_df.loc[:, 'time'])
critical_bikes_or_docks_df.dtypes
station_id int64 bikes_available int64 docks_available int64 time datetime64[ns] dtype: object
df_sf_16_april_2014.head()
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 39 | 14 | 5 | 2014-04-16 00:00:03 |
| 1 | 39 | 14 | 5 | 2014-04-16 00:01:03 |
| 2 | 39 | 14 | 5 | 2014-04-16 00:02:03 |
| 3 | 39 | 14 | 5 | 2014-04-16 00:03:02 |
| 4 | 39 | 14 | 5 | 2014-04-16 00:04:02 |
df_sf_16_april_2014.dtypes
station_id int64 bikes_available int64 docks_available int64 time object dtype: object
df_sf_16_april_2014.loc[:, 'time'] = pd.to_datetime(df_sf_16_april_2014.loc[:, 'time'])
df_sf_16_april_2014.loc[:,
["station_id","bikes_available","docks_available"]] = df_sf_16_april_2014.loc[:,
["station_id","bikes_available","docks_available"]].apply(pd.to_numeric, downcast="integer")
df_sf_16_april_2014.dtypes
station_id int8 bikes_available int8 docks_available int8 time datetime64[ns] dtype: object
df_sf_16_april_2014 = df_sf_16_april_2014.merge(station_info_df.loc[:, ['id', 'lat', 'long']], left_on='station_id', right_on='id')
df_sf_16_april_2014.drop(columns=['id'], inplace=True)
df_sf_16_april_2014.head()
| station_id | bikes_available | docks_available | time | lat | long | |
|---|---|---|---|---|---|---|
| 0 | 39 | 14 | 5 | 2014-04-16 00:00:03 | 37.783871 | -122.408433 |
| 1 | 39 | 14 | 5 | 2014-04-16 00:01:03 | 37.783871 | -122.408433 |
| 2 | 39 | 14 | 5 | 2014-04-16 00:02:03 | 37.783871 | -122.408433 |
| 3 | 39 | 14 | 5 | 2014-04-16 00:03:02 | 37.783871 | -122.408433 |
| 4 | 39 | 14 | 5 | 2014-04-16 00:04:02 | 37.783871 | -122.408433 |
df_sf_16_april_2014["time_of_day"] = df_sf_16_april_2014.time.dt.time
df_sf_16_april_2014["hour"] = df_sf_16_april_2014.time.dt.hour
df_sf_16_april_2014.head()
| station_id | bikes_available | docks_available | time | lat | long | time_of_day | hour | |
|---|---|---|---|---|---|---|---|---|
| 0 | 39 | 14 | 5 | 2014-04-16 00:00:03 | 37.783871 | -122.408433 | 00:00:03 | 0 |
| 1 | 39 | 14 | 5 | 2014-04-16 00:01:03 | 37.783871 | -122.408433 | 00:01:03 | 0 |
| 2 | 39 | 14 | 5 | 2014-04-16 00:02:03 | 37.783871 | -122.408433 | 00:02:03 | 0 |
| 3 | 39 | 14 | 5 | 2014-04-16 00:03:02 | 37.783871 | -122.408433 | 00:03:02 | 0 |
| 4 | 39 | 14 | 5 | 2014-04-16 00:04:02 | 37.783871 | -122.408433 | 00:04:02 | 0 |
import plotly.express as px
import plotly.graph_objects as go
city_station_df = station_info_df.groupby(['city'], as_index=False).size().sort_values(by='size')
city_station_df
| city | size | |
|---|---|---|
| 1 | Palo Alto | 5 |
| 0 | Mountain View | 7 |
| 2 | Redwood City | 7 |
| 4 | San Jose | 16 |
| 3 | San Francisco | 35 |
fig = px.pie(city_station_df, values = 'size', names='city', title= 'Anzahl der Mietstationen in jeder Stadt')
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.update_traces(texttemplate='%{percent}'+' '+'(%{value} St.)', textposition='inside')
fig.show()
-- Fahrradverleihstationen gibt es in fünf kalifornischen Städten
fig = px.scatter_mapbox(station_info_df,
lat="lat", lon="long",
color='dock_count',
mapbox_style="carto-positron",
size='dock_count',
size_max=15,
hover_data =["city","name","installation_date"],
title='Fahrradverleihstationen auf der Karte')
fig.show()
city_station_df = station_info_df.groupby(['city'], as_index=False).size().sort_values(by='size')
city_station_df
| city | size | |
|---|---|---|
| 1 | Palo Alto | 5 |
| 0 | Mountain View | 7 |
| 2 | Redwood City | 7 |
| 4 | San Jose | 16 |
| 3 | San Francisco | 35 |
fig = px.pie(city_station_df, values = 'size', names='city', title= 'Anzahl der Stationen in jeder Stadt')
fig.update_traces(textposition='inside', textinfo='percent+value')
fig.update_traces(texttemplate='%{percent}'+' '+'(%{value} St.)', textposition='inside')
fig.show()
top_10_stationen_df = trip_customer_df.groupby(["start_station_id"], as_index=False).size().sort_values("size", ascending=False).iloc[0:10, :]
top_10_stationen_df = top_10_stationen_df.merge(station_info_df, left_on='start_station_id', right_on='id')
top_10_stationen_df.head()
| start_station_id | size | id | name | lat | long | dock_count | city | installation_date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 70 | 49092 | 70 | San Francisco Caltrain (Townsend at 4th) | 37.776617 | -122.395260 | 19 | San Francisco | 8/23/2013 |
| 1 | 69 | 33742 | 69 | San Francisco Caltrain 2 (330 Townsend) | 37.776600 | -122.395470 | 23 | San Francisco | 8/23/2013 |
| 2 | 50 | 32934 | 50 | Harry Bridges Plaza (Ferry Building) | 37.795392 | -122.394203 | 23 | San Francisco | 8/20/2013 |
| 3 | 60 | 27713 | 60 | Embarcadero at Sansome | 37.804770 | -122.403234 | 15 | San Francisco | 8/21/2013 |
| 4 | 55 | 26089 | 55 | Temporary Transbay Terminal (Howard at Beale) | 37.789756 | -122.394643 | 23 | San Francisco | 8/20/2013 |
fig = px.bar(top_10_stationen_df.sort_values("size"), x="size", y="name", orientation='h', color = "city",
labels={'size':'Anzahl Fahrten', 'name':'Stationsname', 'city':'Stadt', 'id':'ID'},
hover_data=['city', 'size','name', 'id'])
fig.show()
station_info_df.loc[:,'installation_date'].unique()
array(['8/6/2013', '8/5/2013', '8/7/2013', '8/12/2013', '8/15/2013',
'8/16/2013', '12/31/2013', '8/14/2013', '8/25/2013', '8/19/2013',
'8/20/2013', '8/21/2013', '8/22/2013', '8/23/2013', '1/22/2014',
'2/20/2014', '4/9/2014'], dtype=object)
station_info_df.loc[:,'installation_date'] = pd.to_datetime(station_info_df.loc[:,'installation_date'], format='%m/%d/%Y')
pd.to_datetime('8/6/2013', format='%m/%d/%Y')
Timestamp('2013-08-06 00:00:00')
tmp_df = station_info_df.groupby(['city', 'installation_date'], as_index=False).size()
city_inst_date_df = tmp_df.groupby(['city', 'installation_date']).sum().groupby(level=0).cumsum().reset_index().sort_values(by='installation_date')
city_inst_date_df.head()
| city | installation_date | size | |
|---|---|---|---|
| 15 | San Jose | 2013-08-05 | 5 |
| 16 | San Jose | 2013-08-06 | 10 |
| 17 | San Jose | 2013-08-07 | 14 |
| 5 | Redwood City | 2013-08-12 | 4 |
| 3 | Palo Alto | 2013-08-14 | 4 |
fig = px.line(city_inst_date_df, x="installation_date", y="size", color="city", markers=True,
title='Daten der Installation der Stationen',
labels={'size' : 'Gesamtzahl der Station in Stadt', 'city':'Stadt', 'installation_date':'Date der Installation'})
#fig.update_traces(textposition="bottom right")
fig.show()
df_subscription = trip_customer_df.groupby(['subscription_type'], as_index=False).size()
df_subscription
| subscription_type | size | |
|---|---|---|
| 0 | Customer | 103213 |
| 1 | Subscriber | 566746 |
fig = px.bar(data_frame=df_subscription,x='subscription_type',y='size', text = 'size', title="The total number of trips depending on the type of subscription",
labels={'size':'Anzahl der Fahrten', 'subscription_type':'Art des Abonnements'})
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_traces(showlegend=False)
# fig.update(layout_coloraxis_showscale=False)
---> Die meisten Benutzer haben ein Fahrradverleih-Abonnement
df_days_trip = trip_customer_df.groupby(['Day_of_week_start', 'Day_of_week_num_start'], as_index=False).size().sort_values(by='Day_of_week_num_start')
df_days_trip
| Day_of_week_start | Day_of_week_num_start | size | |
|---|---|---|---|
| 1 | Monday | 0 | 115873 |
| 5 | Tuesday | 1 | 122259 |
| 6 | Wednesday | 2 | 120201 |
| 4 | Thursday | 3 | 119089 |
| 0 | Friday | 4 | 109361 |
| 2 | Saturday | 5 | 44785 |
| 3 | Sunday | 6 | 38391 |
fig = px.bar(data_frame=df_days_trip,x='Day_of_week_start',y='size', text = 'size', title="Nutzung über die Wochentage",
labels={'size':'Anzahl der Fahrten', 'Day_of_week_start':'Wochentag'})
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_traces(showlegend=False)
df_days_trip = trip_customer_df.groupby(
['Day_of_week_start', 'Day_of_week_num_start', 'subscription_type'], as_index=False, observed=True
).size().sort_values(by=['Day_of_week_num_start', 'subscription_type'])
df_days_trip
| Day_of_week_start | Day_of_week_num_start | subscription_type | size | |
|---|---|---|---|---|
| 3 | Monday | 0 | Customer | 11469 |
| 2 | Monday | 0 | Subscriber | 104404 |
| 11 | Tuesday | 1 | Customer | 11040 |
| 10 | Tuesday | 1 | Subscriber | 111219 |
| 13 | Wednesday | 2 | Customer | 11495 |
| 12 | Wednesday | 2 | Subscriber | 108706 |
| 9 | Thursday | 3 | Customer | 12451 |
| 8 | Thursday | 3 | Subscriber | 106638 |
| 1 | Friday | 4 | Customer | 14946 |
| 0 | Friday | 4 | Subscriber | 94415 |
| 5 | Saturday | 5 | Customer | 22125 |
| 4 | Saturday | 5 | Subscriber | 22660 |
| 7 | Sunday | 6 | Customer | 19687 |
| 6 | Sunday | 6 | Subscriber | 18704 |
fig = px.bar(df_days_trip, x='Day_of_week_start', y='size', title="Gesamtzahl der Fahrten je nach Wochentag und Abonnementtyp",
text = 'size', color = 'subscription_type',
labels={'size':'Anzahl der Fahrten', 'Day_of_week_start':'Wochentag', 'subscription_type' : 'Abonnementstyp'})
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.show()
fig = px.line(df_days_trip, x="Day_of_week_start", y="size",
title='Gesamtzahl der Fahrten je nach Wochentag und Abonnementtyp',
color = 'subscription_type',
labels={'size':'Anzahl der Fahrten', 'Day_of_week_start':'Wochentag', 'subscription_type' : 'Abonnementstyp'})
fig.update_traces(mode="markers+lines")
fig.show()
df_city_trip = trip_customer_df.groupby(['city_start'], as_index=False).size().sort_values(by='size')
df_city_trip
| city_start | size | |
|---|---|---|
| 2 | Redwood City | 3433 |
| 1 | Palo Alto | 6773 |
| 0 | Mountain View | 18167 |
| 4 | San Jose | 37878 |
| 3 | San Francisco | 603708 |
fig = px.pie(df_city_trip, values = 'size', names='city_start', title= 'Anzahl der Fahrten in jeder Stadt')
fig.show()
---> 90% der Reisen sind in San Francisco
trip_customer_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 669959 entries, 0 to 669958 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 669959 non-null int32 1 duration 669959 non-null int32 2 start_date 669959 non-null datetime64[ns] 3 start_station_name 669959 non-null object 4 start_station_id 669959 non-null int8 5 end_date 669959 non-null datetime64[ns] 6 end_station_name 669959 non-null object 7 end_station_id 669959 non-null int8 8 bike_id 669959 non-null int16 9 subscription_type 669959 non-null category 10 zip_code 663340 non-null object 11 city_start 669959 non-null object 12 lat_start 669959 non-null float32 13 long_start 669959 non-null float32 14 city_end 669959 non-null object 15 lat_end 669959 non-null float32 16 long_end 669959 non-null float32 17 Year_start 669959 non-null int64 18 Month_start 669959 non-null int64 19 Day_start 669959 non-null int64 20 Day_of_week_start 669959 non-null object 21 Day_of_week_num_start 669959 non-null int64 22 Time_start 669959 non-null object 23 rental_duration 669959 non-null timedelta64[ns] dtypes: category(1), datetime64[ns](2), float32(4), int16(1), int32(2), int64(4), int8(2), object(7), timedelta64[ns](1) memory usage: 95.2+ MB
trip_customer_df.head()
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | ... | city_end | lat_end | long_end | Year_start | Month_start | Day_start | Day_of_week_start | Day_of_week_num_start | Time_start | rental_duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 2013-08-29 09:08:00 | 2nd at South Park | 64 | 2013-08-29 09:11:00 | 2nd at South Park | 64 | 288 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 09:08:00 | 0 days 00:03:00 |
| 1 | 4426 | 982 | 2013-08-29 12:52:00 | 2nd at South Park | 64 | 2013-08-29 13:08:00 | 2nd at South Park | 64 | 593 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 12:52:00 | 0 days 00:16:00 |
| 2 | 4557 | 130 | 2013-08-29 13:57:00 | 2nd at South Park | 64 | 2013-08-29 13:59:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 13:57:00 | 0 days 00:02:00 |
| 3 | 4564 | 1693 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:32:00 | 2nd at South Park | 64 | 272 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:28:00 |
| 4 | 4566 | 1699 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:33:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | San Francisco | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:29:00 |
5 rows × 24 columns
trip_customer_df['Hour_start'] = trip_customer_df.start_date.dt.hour
trip_customer_df.head()
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | ... | lat_end | long_end | Year_start | Month_start | Day_start | Day_of_week_start | Day_of_week_num_start | Time_start | rental_duration | Hour_start | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 2013-08-29 09:08:00 | 2nd at South Park | 64 | 2013-08-29 09:11:00 | 2nd at South Park | 64 | 288 | Subscriber | ... | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 09:08:00 | 0 days 00:03:00 | 9 |
| 1 | 4426 | 982 | 2013-08-29 12:52:00 | 2nd at South Park | 64 | 2013-08-29 13:08:00 | 2nd at South Park | 64 | 593 | Subscriber | ... | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 12:52:00 | 0 days 00:16:00 | 12 |
| 2 | 4557 | 130 | 2013-08-29 13:57:00 | 2nd at South Park | 64 | 2013-08-29 13:59:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 13:57:00 | 0 days 00:02:00 | 13 |
| 3 | 4564 | 1693 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:32:00 | 2nd at South Park | 64 | 272 | Subscriber | ... | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:28:00 | 14 |
| 4 | 4566 | 1699 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:33:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | 37.782261 | -122.392738 | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:29:00 | 14 |
5 rows × 25 columns
df_days_hours = trip_customer_df.groupby(['Day_of_week_start', 'Day_of_week_num_start', 'Hour_start'], as_index=False).size().sort_values(by='Day_of_week_num_start')
df_days_hours
| Day_of_week_start | Day_of_week_num_start | Hour_start | size | |
|---|---|---|---|---|
| 47 | Monday | 0 | 23 | 515 |
| 27 | Monday | 0 | 3 | 49 |
| 28 | Monday | 0 | 4 | 182 |
| 29 | Monday | 0 | 5 | 593 |
| 30 | Monday | 0 | 6 | 2739 |
| ... | ... | ... | ... | ... |
| 75 | Sunday | 6 | 3 | 82 |
| 74 | Sunday | 6 | 2 | 178 |
| 73 | Sunday | 6 | 1 | 291 |
| 82 | Sunday | 6 | 10 | 2728 |
| 83 | Sunday | 6 | 11 | 3402 |
168 rows × 4 columns
fig = px.scatter(df_days_hours, x="Day_of_week_start", y="Hour_start", size='size', size_max = 25, #text = 'size',
labels={'Hour_start':'Uhrzeit der Fahrradabholung', 'Day_of_week_start':'Wochentag'},
title = "Zu welchen Zeiten der Dienst am meisten genutzt wird")
# fig.update_traces(texttemplate='%{text}', textposition='middle right')
fig.show()
Das Diagramm zeigt, dass der Dienst an Wochentagen morgens und abends am meisten genutzt wird, an Wochenenden in der Mitte des Tages.
---> Stellen wir die durchschnittliche Wochenend- und Wochentagskurve dar.
df_days_hours.sort_values("Hour_start", inplace=True)
df_days_hours["Hour_start_str"] = df_days_hours.Hour_start.apply(lambda x: f"{x} Uhr")
df_days_hours["day_type"] = df_days_hours.Day_of_week_num_start.apply(lambda x: "Working" if x < 5 else "Weekend")
df_days_hours.head()
| Day_of_week_start | Day_of_week_num_start | Hour_start | size | Hour_start_str | day_type | |
|---|---|---|---|---|---|---|
| 72 | Sunday | 6 | 0 | 455 | 0 Uhr | Weekend |
| 24 | Monday | 0 | 0 | 180 | 0 Uhr | Working |
| 0 | Friday | 4 | 0 | 302 | 0 Uhr | Working |
| 96 | Thursday | 3 | 0 | 277 | 0 Uhr | Working |
| 144 | Wednesday | 2 | 0 | 253 | 0 Uhr | Working |
df_day_type = df_days_hours.groupby(["day_type", "Hour_start", "Hour_start_str"], as_index=False).agg(size = ('size', np.mean))
df_day_type.sort_values("Hour_start", inplace=True)
df_day_type.head()
| day_type | Hour_start | Hour_start_str | size | |
|---|---|---|---|---|
| 0 | Weekend | 0 | 0 Uhr | 469.0 |
| 24 | Working | 0 | 0 Uhr | 246.6 |
| 25 | Working | 1 | 1 Uhr | 116.6 |
| 1 | Weekend | 1 | 1 Uhr | 303.0 |
| 2 | Weekend | 2 | 2 Uhr | 177.0 |
fig = px.line_polar(df_day_type, r="size", theta="Hour_start_str", color="day_type", line_close=True,
color_discrete_sequence=px.colors.qualitative.Plotly,
labels={'day_type':'Tag'})
fig.show()
trip_customer_df['duration_h'] = trip_customer_df['duration'] / 3600
trip_customer_df["duration_min"] = trip_customer_df["duration"] / 60
customer_trips_df = trip_customer_df.loc[trip_customer_df.loc[:, 'subscription_type'] == 'Customer', :]
subscriber_trips_df = trip_customer_df.loc[trip_customer_df.loc[:, 'subscription_type'] == 'Subscriber', :]
df_days_hours_customer = customer_trips_df.groupby(
['Day_of_week_start', 'Day_of_week_num_start', 'Hour_start'],
as_index=False).size().sort_values(by='Day_of_week_num_start')
df_days_hours_subscriber = subscriber_trips_df.groupby(
['Day_of_week_start', 'Day_of_week_num_start', 'Hour_start'],
as_index=False).size().sort_values(by='Day_of_week_num_start')
fig = px.scatter(df_days_hours_customer, x="Day_of_week_start", y="Hour_start", size='size', size_max = 25, #text = 'size',
labels={'Hour_start':'Uhrzeit der Fahrradabholung', 'Day_of_week_start':'Wochentag'},
title = "Wie Customers den Service nutzen")
fig.show()
fig = px.scatter(df_days_hours_subscriber, x="Day_of_week_start", y="Hour_start", size='size', size_max = 25, #text = 'size',
labels={'Hour_start':'Uhrzeit der Fahrradabholung', 'Day_of_week_start':'Wochentag'},
title = "Wie Subscribers den Service nutzen")
fig.show()
# station_tripcount_duration = trip_customer_df.groupby(["bike_id"], as_index=False).agg(
# duration = ('duration_h', np.sum),
# count = ('bike_id','size')).sort_values(by='bike_id', ascending=True)
# station_tripcount_duration
# station_tripcount_duration.describe()
# fig = px.scatter(
# station_tripcount_duration.loc[station_tripcount_duration["duration"] < 1000, :],
# x="count",
# y="duration",
# title="Nutzungsdauer (h) gegen Anzahl Nutzungen",
# labels={'count':'Anzahl Fahrten', 'duration':'Gesamtreisedauer'},
# hover_data=['count', 'duration','bike_id'])
# fig.show()
trip_customer_df.loc[:, ['duration', 'duration_min', 'duration_h']].describe()
| duration | duration_min | duration_h | |
|---|---|---|---|
| count | 6.699590e+05 | 669959.000000 | 669959.000000 |
| mean | 1.107950e+03 | 18.465831 | 0.307764 |
| std | 2.225544e+04 | 370.923950 | 6.182066 |
| min | 6.000000e+01 | 1.000000 | 0.016667 |
| 25% | 3.440000e+02 | 5.733333 | 0.095556 |
| 50% | 5.170000e+02 | 8.616667 | 0.143611 |
| 75% | 7.550000e+02 | 12.583333 | 0.209722 |
| max | 1.727040e+07 | 287840.000000 | 4797.333333 |
max(trip_customer_df["rental_duration"])
Timedelta('199 days 22:19:00')
-- Die Mindestfahrzeit beträgt eine Minute.
-- Mittlere Reisezeit von etwa achteinhalb Minuten
-- Es gibt Ausreißer. Die maximale Reisedauer beträgt 287840 Minuten, das sind fast 200 Tage.
Deshalb überprüfen die Anzahl der Fahrten, die weniger als zwei Minuten und mehr als einen Tag dauern
len(trip_customer_df.loc[trip_customer_df['duration'] < 120, :]) / len(trip_customer_df) * 100
0.7934813921448924
!!! Fast ein Prozent der Fahrten unter zwei Minuten ist.
df_trips_longer24h = trip_customer_df.loc[
trip_customer_df['duration'] > 24*60*60 ,
['start_station_id', 'end_station_id', 'subscription_type', 'duration', 'duration_min', 'duration_h', 'rental_duration']
].sort_values("duration", ascending=False)
df_trips_longer24h
| start_station_id | end_station_id | subscription_type | duration | duration_min | duration_h | rental_duration | |
|---|---|---|---|---|---|---|---|
| 135291 | 66 | 62 | Customer | 17270400 | 287840.000000 | 4797.333333 | 199 days 22:19:00 |
| 416121 | 77 | 68 | Customer | 2137000 | 35616.666667 | 593.611111 | 24 days 17:37:00 |
| 630246 | 31 | 32 | Subscriber | 1852590 | 30876.500000 | 514.608333 | 21 days 10:36:00 |
| 613014 | 35 | 35 | Customer | 1133540 | 18892.333333 | 314.872222 | 13 days 02:52:00 |
| 612403 | 35 | 35 | Customer | 722236 | 12037.266667 | 200.621111 | 8 days 08:37:00 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 665985 | 3 | 3 | Customer | 86890 | 1448.166667 | 24.136111 | 1 days 00:09:00 |
| 608480 | 29 | 29 | Customer | 86668 | 1444.466667 | 24.074444 | 1 days 00:05:00 |
| 645219 | 8 | 8 | Customer | 86609 | 1443.483333 | 24.058056 | 1 days 00:04:00 |
| 611014 | 37 | 37 | Customer | 86573 | 1442.883333 | 24.048056 | 1 days 00:02:00 |
| 627343 | 32 | 30 | Customer | 86553 | 1442.550000 | 24.042500 | 1 days 00:03:00 |
296 rows × 7 columns
Wenn sich Personen ein Fahrrad länger ausleihen und bei sich stehen haben, kann niemand anderes es nutzen.
--> Daher prüfen, ob es mit dem Art des Abonnements zusammenhängt.
df_trips_longer24h["duration_d"] = df_trips_longer24h["duration_h"] / 24
df_trips_longer24h.groupby("subscription_type")["duration_d"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| subscription_type | ||||||||
| Customer | 248.0 | 3.292132 | 12.735369 | 1.001771 | 1.179682 | 1.914531 | 2.853788 | 199.888889 |
| Subscriber | 48.0 | 2.548488 | 3.130647 | 1.012998 | 1.125489 | 1.897280 | 2.621849 | 21.442014 |
Die Customer verhalten sich deutlich anders als die Subscriber.
fig = px.histogram(df_trips_longer24h.query("duration_d<10"),
x="duration_d",
nbins=50,
marginal="box",
color="subscription_type",
title='Histogramm für 1 < lange Ausleihzeiten < 10 Tage',
labels={'duration_d':'Leihzeitraum in Tagen', 'subscription_type': 'Kundentyp'})
fig.show()
df_trips_longer24h.query("duration_d>=10")
| start_station_id | end_station_id | subscription_type | duration | duration_min | duration_h | rental_duration | duration_d | |
|---|---|---|---|---|---|---|---|---|
| 135291 | 66 | 62 | Customer | 17270400 | 287840.000000 | 4797.333333 | 199 days 22:19:00 | 199.888889 |
| 416121 | 77 | 68 | Customer | 2137000 | 35616.666667 | 593.611111 | 24 days 17:37:00 | 24.733796 |
| 630246 | 31 | 32 | Subscriber | 1852590 | 30876.500000 | 514.608333 | 21 days 10:36:00 | 21.442014 |
| 613014 | 35 | 35 | Customer | 1133540 | 18892.333333 | 314.872222 | 13 days 02:52:00 | 13.119676 |
trip_customer_df.groupby("subscription_type")["duration_min"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| subscription_type | ||||||||
| Customer | 103213.0 | 65.862689 | 933.941497 | 1.0 | 10.966667 | 18.600000 | 38.816667 | 287840.0 |
| Subscriber | 566746.0 | 9.834148 | 57.525599 | 1.0 | 5.383333 | 7.933333 | 11.100000 | 30876.5 |
-- Die Subscriber haben eine allgemein viel kürzere Nutzungsdauer.
-- Was ist mit Dauern kleiner als eine Minute?
-- Sind die langen Nutzungen bezahlt worden? Muss dort früher nachgehackt werden?
-- Haben Kunden ein Rad dauerhaft bei sich? (Aboschwäche?)
round_trips_df = trip_customer_df.query("start_station_id == end_station_id")
round_trips_df.head()
| id | duration | start_date | start_station_name | start_station_id | end_date | end_station_name | end_station_id | bike_id | subscription_type | ... | Year_start | Month_start | Day_start | Day_of_week_start | Day_of_week_num_start | Time_start | rental_duration | Hour_start | duration_h | duration_min | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4069 | 174 | 2013-08-29 09:08:00 | 2nd at South Park | 64 | 2013-08-29 09:11:00 | 2nd at South Park | 64 | 288 | Subscriber | ... | 2013 | 8 | 29 | Thursday | 3 | 09:08:00 | 0 days 00:03:00 | 9 | 0.048333 | 2.900000 |
| 1 | 4426 | 982 | 2013-08-29 12:52:00 | 2nd at South Park | 64 | 2013-08-29 13:08:00 | 2nd at South Park | 64 | 593 | Subscriber | ... | 2013 | 8 | 29 | Thursday | 3 | 12:52:00 | 0 days 00:16:00 | 12 | 0.272778 | 16.366667 |
| 2 | 4557 | 130 | 2013-08-29 13:57:00 | 2nd at South Park | 64 | 2013-08-29 13:59:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | 2013 | 8 | 29 | Thursday | 3 | 13:57:00 | 0 days 00:02:00 | 13 | 0.036111 | 2.166667 |
| 3 | 4564 | 1693 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:32:00 | 2nd at South Park | 64 | 272 | Subscriber | ... | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:28:00 | 14 | 0.470278 | 28.216667 |
| 4 | 4566 | 1699 | 2013-08-29 14:04:00 | 2nd at South Park | 64 | 2013-08-29 14:33:00 | 2nd at South Park | 64 | 371 | Subscriber | ... | 2013 | 8 | 29 | Thursday | 3 | 14:04:00 | 0 days 00:29:00 | 14 | 0.471944 | 28.316667 |
5 rows × 27 columns
round_trips_df.groupby(["subscription_type"])["duration_min"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| subscription_type | ||||||||
| Customer | 16443.0 | 143.545195 | 371.252668 | 1.0 | 26.133333 | 67.666667 | 149.225000 | 18892.333333 |
| Subscriber | 7538.0 | 23.963072 | 121.404286 | 1.0 | 3.700000 | 12.041667 | 21.733333 | 4831.216667 |
fig = px.histogram(
round_trips_df.query('duration_min<(8*60)'),
x="duration_min",
nbins=50,
marginal="box",
color="subscription_type",
title='Histogramm für Roundtrips < 6h')
fig.show()
round_trips_df.query('duration_min>=(8*60)').groupby("subscription_type")["duration_min"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| subscription_type | ||||||||
| Customer | 632.0 | 1297.459599 | 1399.084513 | 480.450000 | 588.991667 | 1053.450000 | 1376.470833 | 18892.333333 |
| Subscriber | 40.0 | 1212.813750 | 1102.710294 | 483.366667 | 544.162500 | 815.216667 | 1276.075000 | 4831.216667 |
-- Die Customer scheinen Roundtrips eher lange und häufiger zu machen in Form von Ausflügen. Die Subscriber machen eher kürzere Fahrten (Besorgungen?)
trip_customer_df.loc[trip_customer_df['city_start'] != trip_customer_df['city_end'], ['bike_id', 'rental_duration', 'lat_start', 'long_start', 'city_start', 'lat_end', 'long_end', 'city_end']]
| bike_id | rental_duration | lat_start | long_start | city_start | lat_end | long_end | city_end | |
|---|---|---|---|---|---|---|---|---|
| 41365 | 180 | 0 days 06:17:00 | 37.486076 | -122.232086 | Redwood City | 37.776615 | -122.395264 | San Francisco |
| 41366 | 64 | 0 days 06:13:00 | 37.486076 | -122.232086 | Redwood City | 37.776615 | -122.395264 | San Francisco |
| 42491 | 686 | 0 days 05:30:00 | 37.444523 | -122.163094 | Palo Alto | 37.776615 | -122.395264 | San Francisco |
| 42492 | 168 | 0 days 05:29:00 | 37.444523 | -122.163094 | Palo Alto | 37.776615 | -122.395264 | San Francisco |
| 42493 | 240 | 0 days 05:08:00 | 37.444523 | -122.163094 | Palo Alto | 37.776615 | -122.395264 | San Francisco |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 662943 | 98 | 0 days 01:41:00 | 37.394360 | -122.076714 | Mountain View | 37.348743 | -121.894714 | San Jose |
| 664011 | 41 | 0 days 02:57:00 | 37.400242 | -122.099075 | Mountain View | 37.348743 | -121.894714 | San Jose |
| 664012 | 230 | 0 days 02:54:00 | 37.400242 | -122.099075 | Mountain View | 37.348743 | -121.894714 | San Jose |
| 666920 | 13 | 0 days 03:05:00 | 37.406940 | -122.106758 | Mountain View | 37.330696 | -121.888977 | San Jose |
| 668397 | 102 | 0 days 07:37:00 | 37.390278 | -122.066551 | Mountain View | 37.332691 | -121.900085 | San Jose |
1042 rows × 8 columns
-- Die Daten enthält Informationen über 1.042 Intercity-Fahrten
customer_trips_df.loc[:, ['duration', 'duration_min', 'duration_h']].describe()
| duration | duration_min | duration_h | |
|---|---|---|---|
| count | 1.032130e+05 | 103213.000000 | 103213.000000 |
| mean | 3.951761e+03 | 65.862689 | 1.097711 |
| std | 5.603649e+04 | 933.941497 | 15.565692 |
| min | 6.000000e+01 | 1.000000 | 0.016667 |
| 25% | 6.580000e+02 | 10.966667 | 0.182778 |
| 50% | 1.116000e+03 | 18.600000 | 0.310000 |
| 75% | 2.329000e+03 | 38.816667 | 0.646944 |
| max | 1.727040e+07 | 287840.000000 | 4797.333333 |
subscriber_trips_df.loc[:, ['duration', 'duration_min', 'duration_h']].describe()
| duration | duration_min | duration_h | |
|---|---|---|---|
| count | 5.667460e+05 | 566746.000000 | 566746.000000 |
| mean | 5.900489e+02 | 9.834148 | 0.163902 |
| std | 3.451536e+03 | 57.525599 | 0.958760 |
| min | 6.000000e+01 | 1.000000 | 0.016667 |
| 25% | 3.230000e+02 | 5.383333 | 0.089722 |
| 50% | 4.760000e+02 | 7.933333 | 0.132222 |
| 75% | 6.660000e+02 | 11.100000 | 0.185000 |
| max | 1.852590e+06 | 30876.500000 | 514.608333 |
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
duration_day_df = trip_customer_df.groupby(
['subscription_type', 'Day_of_week_start', 'Day_of_week_num_start'], as_index=False, observed=True).agg(
duration_median = ('duration_min', np.median),
#duration_q25 = ('duration_min', percentile(25)),
# duration_q75 = ('duration_min', percentile(75))
).sort_values('Day_of_week_num_start')
duration_day_df.head()
| subscription_type | Day_of_week_start | Day_of_week_num_start | duration_median | |
|---|---|---|---|---|
| 1 | Subscriber | Monday | 0 | 7.900000 |
| 8 | Customer | Monday | 0 | 17.483333 |
| 5 | Subscriber | Tuesday | 1 | 7.850000 |
| 12 | Customer | Tuesday | 1 | 15.783333 |
| 6 | Subscriber | Wednesday | 2 | 7.933333 |
fig = px.line(duration_day_df, x="Day_of_week_start", y="duration_median",
title='Durchschnittliche Fahrtdauer je nach Art des Abonnements und Wochentag',
color = 'subscription_type',
labels={'duration_median':'mediane Fahrtdauer min', 'Day_of_week_start':'Wochentag', 'subscription_type' : 'Abonnementstyp'})
fig.update_traces(mode="markers+lines")
fig.show()
print(round(len(no_bikes_or_docks_df) / 71984434 * 100, 2), '%')
print(round(len(critical_bikes_or_docks_df) / 71984434 * 100, 2), '%')
1.19 % 2.44 %
-- Für über 1% der Zeit sind die Stationen nicht voll nutzbar.
-- Und zu etwa 2.5% der Zeit sind an den Stationen nur noch eine kritische Anzahl an Docks oder Fahrrädern verfügbar.
critical_bikes_or_docks_df['reason'] = 'critical'
no_bikes_or_docks_df['reason'] = no_bikes_or_docks_df.bikes_available.apply(lambda x: "bikes" if x == 0 else "docks")
print(no_bikes_or_docks_df.columns)
print(critical_bikes_or_docks_df.columns)
Index(['station_id', 'bikes_available', 'docks_available', 'time', 'reason'], dtype='object') Index(['station_id', 'bikes_available', 'docks_available', 'time', 'reason'], dtype='object')
no_bikes_or_docks_gr = no_bikes_or_docks_df.groupby(
['station_id', 'reason'], as_index = False).size().sort_values('station_id')
no_bikes_or_docks_gr = no_bikes_or_docks_gr.merge(station_info_df, left_on='station_id', right_on='id')
no_bikes_or_docks_gr['time_problem_h'] = round(no_bikes_or_docks_gr['size'] / 60)
no_bikes_or_docks_gr['time_problem_h'] =no_bikes_or_docks_gr['time_problem_h'].fillna(0)
no_bikes_or_docks_gr.loc[no_bikes_or_docks_gr['station_id'] == 31, :]
| station_id | reason | size | id | name | lat | long | dock_count | city | installation_date | time_problem_h | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 41 | 31 | bikes | 4036 | 31 | San Antonio Shopping Center | 37.400443 | -122.108338 | 15 | Mountain View | 2013-12-31 | 67.0 |
| 42 | 31 | docks | 1035 | 31 | San Antonio Shopping Center | 37.400443 | -122.108338 | 15 | Mountain View | 2013-12-31 | 17.0 |
fig = px.bar(no_bikes_or_docks_gr, x='station_id', y='size', title="Volle und leere Stationen",
color = 'reason', labels={'size':'Dauer in Minuten'},
hover_data =["city", 'station_id', 'size'])
fig.show()
fig = px.scatter_mapbox(no_bikes_or_docks_gr,
lat="lat",
lon="long",
mapbox_style="carto-positron",
color="reason",
size="time_problem_h",
color_discrete_sequence=px.colors.qualitative.G10,
size_max=20,
hover_data =['station_id'])
fig.show()
station_use_df = trip_customer_df.groupby('start_station_id', as_index=False).size()
station_use_df.rename(columns={'size' : 'use_num'}, inplace=True)
len(station_use_df)
70
station_problem_df = station_problem_df.merge(station_use_df, left_on='station_id', right_on='start_station_id')
station_problem_df.drop(columns=['start_station_id'], inplace=True)
station_problem_df
| station_id | no_bikes_min | no_docks_min | use_num | |
|---|---|---|---|---|
| 0 | 2 | 1340 | 466 | 9558 |
| 1 | 3 | 1066 | 4817 | 1594 |
| 2 | 4 | 10618 | 6614 | 3861 |
| 3 | 5 | 1321 | 0 | 1257 |
| 4 | 6 | 3564 | 3633 | 2917 |
| ... | ... | ... | ... | ... |
| 65 | 77 | 4436 | 3129 | 24172 |
| 66 | 80 | 2105 | 5 | 863 |
| 67 | 82 | 17824 | 3738 | 10310 |
| 68 | 83 | 808 | 0 | 341 |
| 69 | 84 | 3642 | 1744 | 1747 |
70 rows × 4 columns
station_problem_df = station_problem_df.merge(
station_info_df.loc[:, ['id', 'lat', 'long', 'name', 'city', 'dock_count']],
left_on='station_id',
right_on='id')
station_problem_df.drop(columns=['id'], inplace=True)
station_problem_df
| station_id | no_bikes_min | no_docks_min | use_num | lat | long | name | city | dock_count | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 1340 | 466 | 9558 | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station | San Jose | 27 |
| 1 | 3 | 1066 | 4817 | 1594 | 37.330698 | -121.888979 | San Jose Civic Center | San Jose | 15 |
| 2 | 4 | 10618 | 6614 | 3861 | 37.333988 | -121.894902 | Santa Clara at Almaden | San Jose | 11 |
| 3 | 5 | 1321 | 0 | 1257 | 37.331415 | -121.893200 | Adobe on Almaden | San Jose | 19 |
| 4 | 6 | 3564 | 3633 | 2917 | 37.336721 | -121.894074 | San Pedro Square | San Jose | 15 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 65 | 77 | 4436 | 3129 | 24172 | 37.789625 | -122.400811 | Market at Sansome | San Francisco | 27 |
| 66 | 80 | 2105 | 5 | 863 | 37.352601 | -121.905733 | Santa Clara County Civic Center | San Jose | 15 |
| 67 | 82 | 17824 | 3738 | 10310 | 37.798541 | -122.400862 | Broadway St at Battery St | San Francisco | 15 |
| 68 | 83 | 808 | 0 | 341 | 37.491269 | -122.236234 | Mezes Park | Redwood City | 15 |
| 69 | 84 | 3642 | 1744 | 1747 | 37.342725 | -121.895617 | Ryland Park | San Jose | 15 |
70 rows × 9 columns
station_problem_df['no_bikes_h'] = station_problem_df['no_bikes_min'] / 60
station_problem_df['no_docks_h'] = station_problem_df['no_docks_min'] / 60
fig = px.scatter(station_problem_df, x="no_bikes_h", y="no_docks_h", color='city',
hover_data=["station_id", "name", 'dock_count'],
size="use_num",
labels={'no_bikes_h':'Ohne Fahrräder (ST)', 'no_docks_h': 'Ohne freie Dockstationen (ST)', 'city':'Stadt'},
title='')
fig.show()
df_sf_16_april_2014.sort_values(by=['station_id', 'time_of_day'], ascending=True)
| station_id | bikes_available | docks_available | time | lat | long | time_of_day | hour | |
|---|---|---|---|---|---|---|---|---|
| 0 | 39 | 14 | 5 | 2014-04-16 00:00:03 | 37.783871 | -122.408433 | 00:00:03 | 0 |
| 1 | 39 | 14 | 5 | 2014-04-16 00:01:03 | 37.783871 | -122.408433 | 00:01:03 | 0 |
| 2 | 39 | 14 | 5 | 2014-04-16 00:02:03 | 37.783871 | -122.408433 | 00:02:03 | 0 |
| 3 | 39 | 14 | 5 | 2014-04-16 00:03:02 | 37.783871 | -122.408433 | 00:03:02 | 0 |
| 4 | 39 | 14 | 5 | 2014-04-16 00:04:02 | 37.783871 | -122.408433 | 00:04:02 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 50395 | 82 | 6 | 9 | 2014-04-16 23:55:03 | 37.798541 | -122.400862 | 23:55:03 | 23 |
| 50396 | 82 | 6 | 9 | 2014-04-16 23:56:02 | 37.798541 | -122.400862 | 23:56:02 | 23 |
| 50397 | 82 | 6 | 9 | 2014-04-16 23:57:03 | 37.798541 | -122.400862 | 23:57:03 | 23 |
| 50398 | 82 | 6 | 9 | 2014-04-16 23:58:03 | 37.798541 | -122.400862 | 23:58:03 | 23 |
| 50399 | 82 | 6 | 9 | 2014-04-16 23:59:03 | 37.798541 | -122.400862 | 23:59:03 | 23 |
50400 rows × 8 columns
type(df_sf_16_april_2014['time_of_day'][0])
datetime.time
df_sf_16_april_2014['time_of_day_str'] = df_sf_16_april_2014['time_of_day'].apply(lambda x: str(x))
fig = px.scatter_mapbox(
df_sf_16_april_2014.loc[::8,:],
lat="lat", lon="long", animation_frame="time_of_day_str", animation_group="station_id",
size="bikes_available",
color='bikes_available',
mapbox_style="carto-positron", range_color=(0,28),
center={"lat": 37.785, "lon": -122.405}, zoom=11.8, size_max=25,
labels={'time_of_day_str':'Zeit', 'bikes_available': 'Fahrräder verfügbar'},
title='San Francisco 16.04.2014'
)
fig.show()
station_70_df
| station_id | bikes_available | docks_available | time | |
|---|---|---|---|---|
| 0 | 70 | 0 | 19 | 2013/08/29 12:06:01 |
| 1 | 70 | 0 | 19 | 2013/08/29 12:07:01 |
| 2 | 70 | 0 | 19 | 2013/08/29 12:08:01 |
| 3 | 70 | 0 | 19 | 2013/08/29 12:09:01 |
| 4 | 70 | 0 | 19 | 2013/08/29 12:10:01 |
| ... | ... | ... | ... | ... |
| 1047135 | 70 | 15 | 4 | 2015-08-31 23:55:02 |
| 1047136 | 70 | 15 | 4 | 2015-08-31 23:56:01 |
| 1047137 | 70 | 15 | 4 | 2015-08-31 23:57:02 |
| 1047138 | 70 | 15 | 4 | 2015-08-31 23:58:02 |
| 1047139 | 70 | 15 | 4 | 2015-08-31 23:59:02 |
1047140 rows × 4 columns
st_70_no_bikes = station_70_df.loc[station_70_df.bikes_available == 0].shape[0]
st_70_no_docks = station_70_df.loc[station_70_df.docks_available == 0].shape[0]
st_70_critical_bikes = station_70_df.loc[station_70_df.bikes_available == 1].shape[0]
st_70_critical_docks = station_70_df.loc[station_70_df.docks_available == 1].shape[0]
st_70_used_num = station_70_df.shape[0] - st_70_no_bikes - st_70_no_docks - st_70_critical_bikes - st_70_critical_docks
d = {
'param': ['no_bikes', 'no_docks', 'critical_bikes', 'critical_docks', 'ok'],
'val': [st_70_no_bikes, st_70_no_docks, st_70_critical_bikes, st_70_critical_docks, st_70_used_num]
}
station_70_sum_df = pd.DataFrame(data=d)
station_70_sum_df
| param | val | |
|---|---|---|
| 0 | no_bikes | 21726 |
| 1 | no_docks | 19422 |
| 2 | critical_bikes | 22596 |
| 3 | critical_docks | 49381 |
| 4 | ok | 934015 |
fig = px.pie(station_70_sum_df, values = 'val', names='param', title= 'Verfügbarkeit des Services')
fig.show()
station_70_df.loc[:, 'time'] = pd.to_datetime(station_70_df.loc[:, 'time'])
station_70_df['Day_of_week'] = station_70_df.loc[:, 'time'].dt.day_name()
station_70_df['Day_of_week_num'] = station_70_df.loc[:, 'time'].dt.day_of_week
station_70_df['time_h'] = station_70_df.loc[:, 'time'].dt.hour
station_70_df
| station_id | bikes_available | docks_available | time | Day_of_week | Day_of_week_num | time_h | |
|---|---|---|---|---|---|---|---|
| 0 | 70 | 0 | 19 | 2013-08-29 12:06:01 | Thursday | 3 | 12 |
| 1 | 70 | 0 | 19 | 2013-08-29 12:07:01 | Thursday | 3 | 12 |
| 2 | 70 | 0 | 19 | 2013-08-29 12:08:01 | Thursday | 3 | 12 |
| 3 | 70 | 0 | 19 | 2013-08-29 12:09:01 | Thursday | 3 | 12 |
| 4 | 70 | 0 | 19 | 2013-08-29 12:10:01 | Thursday | 3 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1047135 | 70 | 15 | 4 | 2015-08-31 23:55:02 | Monday | 0 | 23 |
| 1047136 | 70 | 15 | 4 | 2015-08-31 23:56:01 | Monday | 0 | 23 |
| 1047137 | 70 | 15 | 4 | 2015-08-31 23:57:02 | Monday | 0 | 23 |
| 1047138 | 70 | 15 | 4 | 2015-08-31 23:58:02 | Monday | 0 | 23 |
| 1047139 | 70 | 15 | 4 | 2015-08-31 23:59:02 | Monday | 0 | 23 |
1047140 rows × 7 columns
station_70_df_gr = station_70_df.groupby(['station_id', 'Day_of_week', 'time_h'], as_index=False).agg(
bikes_available_mean = ('bikes_available', np.mean),
docks_available_mean = ('docks_available', np.mean))
station_70_df_gr
| station_id | Day_of_week | time_h | bikes_available_mean | docks_available_mean | |
|---|---|---|---|---|---|
| 0 | 70 | Friday | 0 | 14.501036 | 4.489399 |
| 1 | 70 | Friday | 1 | 14.524098 | 4.466326 |
| 2 | 70 | Friday | 2 | 14.525052 | 4.465405 |
| 3 | 70 | Friday | 3 | 14.510987 | 4.479459 |
| 4 | 70 | Friday | 4 | 14.524908 | 4.465542 |
| ... | ... | ... | ... | ... | ... |
| 163 | 70 | Wednesday | 19 | 12.299020 | 6.570259 |
| 164 | 70 | Wednesday | 20 | 12.455159 | 6.459820 |
| 165 | 70 | Wednesday | 21 | 13.679801 | 5.307977 |
| 166 | 70 | Wednesday | 22 | 13.903775 | 5.087711 |
| 167 | 70 | Wednesday | 23 | 13.746343 | 5.244012 |
168 rows × 5 columns
fig = px.line(station_70_df_gr,
x='time_h',
y='bikes_available_mean',
color='Day_of_week',
labels={'bikes_available_mean':'Anzahl verfügbarer Fahrräder',
'Day_of_week':'Wochentag',
'time_h': 'Zeit'},
title='Durchschnittliche Anzahl der Fahrräder, die an der Station 70 während der Woche zur Verfügung stehen',
hover_name='Day_of_week')
fig.show()
fig = px.line(station_70_df_gr,
x='time_h',
y='docks_available_mean',
color='Day_of_week',
labels={'docks_available_mean':'Anzahl verfügbarer Dockstationen',
'size':'Counts of Available Bikes-Numbers',
'Day_of_week':'Wochentag',
'time_h': 'Zeit'},
title='Durchschnittliche Anzahl der Dockstationen, die an der Station 70 während der Woche zur Verfügung stehen',
hover_name='Day_of_week')
fig.show()
station_status_df.head()
| station_id | avg_bikes_available | avg_docks_available | day_of_week | h | |
|---|---|---|---|---|---|
| 0 | 2 | 13.675880 | 13.295195 | 0 | 00 |
| 1 | 2 | 13.610708 | 13.360264 | 0 | 01 |
| 2 | 2 | 13.417803 | 13.552636 | 0 | 02 |
| 3 | 2 | 13.505408 | 13.465537 | 0 | 03 |
| 4 | 2 | 13.496455 | 13.474541 | 0 | 04 |
station_status_df['day_of_week'] = station_status_df['day_of_week'].astype(int)
station_status_df['h'] = station_status_df['h'].astype(int)
# station_status_df
station_status_df["day_type"] = station_status_df.day_of_week.apply(lambda x: "Working" if x < 5 else "Weekend")
station_status_df = station_status_df.merge(
station_info_df.loc[:, ['id', 'city', 'lat', 'long', 'name']],
left_on='station_id', right_on='id')
station_status_df
| station_id | avg_bikes_available | avg_docks_available | day_of_week | h | day_type | id | city | lat | long | name | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 13.675880 | 13.295195 | 0 | 0 | Working | 2 | San Jose | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station |
| 1 | 2 | 13.610708 | 13.360264 | 0 | 1 | Working | 2 | San Jose | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station |
| 2 | 2 | 13.417803 | 13.552636 | 0 | 2 | Working | 2 | San Jose | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station |
| 3 | 2 | 13.505408 | 13.465537 | 0 | 3 | Working | 2 | San Jose | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station |
| 4 | 2 | 13.496455 | 13.474541 | 0 | 4 | Working | 2 | San Jose | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11755 | 84 | 7.027626 | 7.958676 | 6 | 19 | Weekend | 84 | San Jose | 37.342725 | -121.895617 | Ryland Park |
| 11756 | 84 | 7.094770 | 7.891528 | 6 | 20 | Weekend | 84 | San Jose | 37.342725 | -121.895617 | Ryland Park |
| 11757 | 84 | 7.171005 | 7.815297 | 6 | 21 | Weekend | 84 | San Jose | 37.342725 | -121.895617 | Ryland Park |
| 11758 | 84 | 7.242466 | 7.743836 | 6 | 22 | Weekend | 84 | San Jose | 37.342725 | -121.895617 | Ryland Park |
| 11759 | 84 | 7.290183 | 7.696119 | 6 | 23 | Weekend | 84 | San Jose | 37.342725 | -121.895617 | Ryland Park |
11760 rows × 11 columns
station_status_df_gr = station_status_df.groupby(
['station_id', 'h', 'day_type', 'city', 'lat', 'long'],
as_index=False).agg(
avg_bikes_available = ('avg_bikes_available', np.mean),
avg_docks_available = ('avg_docks_available', np.mean))
max(station_status_df.avg_bikes_available)
18.846276257161044
fig_station_dev = px.scatter_mapbox(data_frame=station_status_df_gr.loc[station_status_df_gr.day_type == 'Working', :],
title="Nutzer Zeiten während Wochentage",
lat='lat',
lon='long',
size="avg_bikes_available",
color="avg_bikes_available",
color_continuous_scale="greens",
range_color=(0,19),
size_max=25,
animation_frame="h",
mapbox_style="carto-positron",
# width=1000,
height=850)
fig_station_dev.show()